1 Introduction & Set-up

Today, we will be going through The Impact of Hurricane Maria on Out-migration from Puerto Rico: Evidence from Facebook Data. The GitHub Repository can be found here. The GitHub Repository for this tutorial can be found here.

In this paper, Alexander et al use Facebook advertising data to estimate migration of Puerto Ricans to the continental United States after Hurricane Maria. Given the large scale use of Facebook in Puerto Rico, they argue that using Facebook data is appropriate in this instance. They collect Facebook demographic data every 2-3 months (they have six waves in total) from January 2017 to March 2018. Hurricane Maria took place between September 2017 and October 2017, enabling the authors to see the effect of the hurricane on the population demographics of Puerto Rico. They use a difference-in-differences model to compare the size of the Puerto Rican and continental US populations before and after the hurricane, finding that between October 2017 and January 2018, there was a 17% increase in the number of migrants from Puerto Rico the the continental United States–with the most migrating to Florida. They also find that there were more male migrants than females, with many being younger, i.e. in working age groups. They also find that after the hurricane, 1.8% of migrants returned to Puerto Rico. The findings aligned with prior research.

A few examples of instances in which the Facebook Ads Manager has previously been used include:

  1. Tracking cross-border migration in Europe during the Coronavirus pandemic
  2. Using FacebookAds for election polling
  3. To create estimates to improve global development statistics
  4. To create global COVID-19 Surveys to aid in policy-making which was presented at the CSDE seminar series last year; the presentation can be viewed on YouTube here.

In this R Tutorial, we will be using the data Alexander et al. provide on their GitHub for replicability. We will not be extracting new data from the Facebook API. However, if you are interested in learning how to do that, this is the general process:

  1. Access the Web Interface of the Marketing API here and define relevant attributes.
  2. Create a developer account and obtain an access token as a Facebook developer.
  3. Access Facebook data that is publicly available.

For more information on how to extract data from the Facebook Ads Manager API using Python see this guide and the information here, here, and here. The Facebook Developers website provides comprehensive directions. When using access the Web Interface of the Marketing API using Python, you will need to install pySocialWatcher, which can be found on GitHub.

1.1 Load in Necessary Libraries

First, we will download the necessary libraries.

## load libaries
library(kableExtra)      # for making pretty tables
library(RColorBrewer)    # for color palettes
library(lubridate)       # for time data
library(rnaturalearth)   # for world shapefile
library(sf)              # for simple feature mapping
library(mapview)         # for simple interactive mapping
library(tmap)            # for interactive mapping
library(tidyverse)       # for data wrangling  
library(viridis)         # for plot colors
library(migest)          # for migration visualization 
library(dplyr)           # because Aryaa's quirky R needs her to explicitly call it for it read in pipes 

2 Facebook Data

2.1 Loading in the Facebook Data

Next, we will load in the FB data from GitHub, wrangle the data into the format we want, and add in wave information.

## load the raw facebook data
#fb.raw <- read_csv("./data/facebook/master_acs_facebook_data.csv")  # if you've downloaded or cloned the github repo, otherwise
fb.raw <- read_csv("https://raw.githubusercontent.com/MJAlexander/fb-migration-hurricane-maria/master/data/facebook/master_acs_facebook_data.csv") 

dim(fb.raw)  # 25,200 rows, 30 columns

## clean & formatting the data
fb <- fb.raw %>%
  # extract age group from character string (ex ages_15_19)
  mutate(age_group = as.numeric(unlist(lapply(strsplit(age_group, "_"), '[[', 2)) ) ) %>%
  #selecting relevant variables 
  select(state, origin, age_group, sex, expat_population_wave1:facebook_population_wave7) %>%
  # add wave indicator
  pivot_longer(5:18, names_to="var", values_to="value") %>% 
  mutate(wave = as.numeric(substr(var, str_length(var), str_length(var)) ),
         var = substr(var, 1, str_length(var)-6)) %>%
  pivot_wider(names_from=var, values_from=value) %>%
  # "fix" state names (states with a space in the name, e.g. new york, are coded with an underscore in the name, e.g. new_york)
  mutate(state = ifelse(grepl("_", state), str_replace(state, "_", " "), state))
  
## add in corresponding dates for each wave: 
##   requires lubridate package (part of the tidyverse, for date/time data, but need to separately load library)
#require(lubridate)
fb <- fb %>% 
  mutate(date = case_when(
    wave==1~ymd(20170101),
    wave==2~ymd(20170401),
    wave==3~ymd(20170601),
    wave==4~ymd(20171001),
    wave==5~ymd(20180101),
    wave==6~ymd(20180301),
    wave==7~ymd(20180701)
  ))
## look at cleaned fb data
#dim(fb)  # 176,400 rows, 8 columns: 7 waves x 2 sex x 9 age groups x 50 states x 28 origin countries
   head(fb)# note the date format on the date column; not numerical but in data format 
## # A tibble: 6 x 8
##   state   origin    age_group   sex  wave expat_population facebook_population
##   <chr>   <chr>         <dbl> <dbl> <dbl>            <dbl>               <dbl>
## 1 Alabama Australia        15     2     1               30              120000
## 2 Alabama Australia        15     2     2               30               99000
## 3 Alabama Australia        15     2     3               30               97000
## 4 Alabama Australia        15     2     4               20               90000
## 5 Alabama Australia        15     2     5               20               81000
## 6 Alabama Australia        15     2     6             1000               86000
## # ... with 1 more variable: date <date>

The FB data is in long format, and has an observation (row) for each wave-state-country of origin-sex-age group combination. There are a total of 176,400 rows: one for each combination of wave (7), sex (2), 5-year age groups (9), states (50; DC is not included), and country of origin (28).

Data note: The data from waves 6 and 7 contain a significant number of “1000”s in the expat_population column (estimated number of migrants), which is suggestive of bottom-coding. In earlier waves, it appears the corresponding bottom-code count was “20”. It is possible the large numbers of “20”s in the earlier waves, and “1000”s in later waves, may be hiding other potential data issues. In general, the FB expat_population and facebook_population estimates clearly involve some rounding.

We will now share some descriptives and visualizations to better understand the clean data. Each new ‘wave’ of data is collected on the first of the month, every couple of months.

Because the data contains an observation for each state-country of origin-sex-age group combination, and there is evidence of bottom coding and data suppression, the estimates below are biased. This is especially obvious from the estimates for waves 6 and 7.

2.2 Total US population

Alexander et al use FB data to estimate the size of the US population. From their estimations, there seems to be a general decrease in the coverage of the US population from Wave 1 to Wave 7, with an particular decrease in Wave 5. These estimations are clearly an undercount compared to population estimates by the US Census Bureau.

In 2017, the US population was around 325 million according to the US Census Bureau American Community Survey and Population Estimates Program.

## estimate total facebook population by wave
# filtering by Mexico but it doesn't matter - all countries are the same total population
# (i.e. the facebook_population count is unique only for each specific combination of age-sex-state-wave)

# this is the total (fb) population per state-age-sex-wave combination
total_fb_pop <- fb %>% filter(origin=="Mexico") %>% group_by(wave, date) %>% summarise(fb = sum(facebook_population))

#creating a table of the total fb pop
kable(total_fb_pop, 
      format.args=list(big.mark=','), col.names=c("wave", "date", "estimated US pop")) %>%
  kable_paper()
wave date estimated US pop
1 2017-01-01 184,755,000
2 2017-04-01 173,047,000
3 2017-06-01 172,443,900
4 2017-10-01 169,208,700
5 2018-01-01 164,848,800
6 2018-03-01 171,468,000
7 2018-07-01 172,371,000
#using ggplot to visualize fb pop estimations by wave
ggplot(data=total_fb_pop, aes(x=wave, y=fb/1e6)) +
  geom_line()+
  geom_point()+
  labs(title="Estimated US Population Using FB data",x="Waves", y = "Population Estimate in millions") +
  scale_x_continuous(name = "wave", breaks=seq(1:7)) +
  theme(panel.grid.minor.x = element_blank())

# alternatively the estimated fb population can be calculated using the following
fb %>%
  # remove duplicate entries
  distinct(state, age_group, sex, wave, facebook_population, date) %>%
  # calculate estimated population by wave
  group_by(wave) %>%
  summarize(facebook_population = sum(facebook_population), .groups='drop')

2.3 Total expats

Similarly, the Facebook data can be used to estimate the size of the immigrant population in the US (Facebook calls this the expat population). As is evident from the table below, the number of migrants in the US is steady from Wave 1 to Wave 5. There is a decrease from Wave 4 to Wave 5. Wave 6 and Wave 7 are more similar to each other.

#looking at migrant count by date: helps us see what 'dates' are what 'waves'
mig_count <- fb %>% group_by(wave, date) %>% summarise(n = sum(expat_population))
kable(mig_count, 
      format.args=list(big.mark=','), col=c("Wave", "Date", "Count")) %>%
  kable_paper()
Wave Date Count
1 2017-01-01 16,895,070
2 2017-04-01 16,760,020
3 2017-06-01 17,107,930
4 2017-10-01 15,406,610
5 2018-01-01 16,971,970
6 2018-03-01 37,461,200
7 2018-07-01 37,766,900

2.4 Expats by age

Now, we will examine the age patterns of migrants within the different waves. The Facebook data only contains information on individuals between the ages of 15 to 59. As per the table and graph below, of all the migrants in the Facebook data, there were more migrants in their mid-20s and mid-30s compared to younger (teenagers) or older (past 50) populations. All of the waves had similar age patterns.

#looking at how many people who migrated were of what age group generally 
age_comparison <- fb %>% group_by(age_group, wave, date) %>% summarise(n = sum(expat_population))

kable(age_comparison %>%
        mutate(wave = paste("wave", wave, date, sep="_")) %>%
        select(-date) %>%
        pivot_wider(names_from=wave, values_from=n), 
      format.args=list(big.mark=','), 
      caption="estimated age distribution of the migrant population in the US, based on FB advertising data") %>%
  kable_paper()
Table 2.1: estimated age distribution of the migrant population in the US, based on FB advertising data
age_group wave_1_2017-01-01 wave_2_2017-04-01 wave_3_2017-06-01 wave_4_2017-10-01 wave_5_2018-01-01 wave_6_2018-03-01 wave_7_2018-07-01
15 945,970 884,480 901,210 660,940 770,730 3,229,600 3,209,900
20 2,308,190 2,227,480 2,257,480 1,876,490 2,068,380 4,252,300 4,247,500
25 2,768,310 2,730,640 2,780,030 2,472,540 2,739,000 4,874,200 4,953,300
30 2,657,730 2,639,980 2,699,240 2,450,700 2,653,520 4,847,900 4,885,000
35 2,475,630 2,458,660 2,519,720 2,351,390 2,538,210 4,763,900 4,796,100
40 2,034,990 2,032,440 2,080,290 1,945,490 2,091,520 4,374,400 4,409,000
45 1,646,890 1,674,260 1,710,910 1,626,080 1,796,720 4,066,800 4,128,900
50 1,177,110 1,200,870 1,232,050 1,155,020 1,305,970 3,652,400 3,701,600
55 880,250 911,210 927,000 867,960 1,007,920 3,399,700 3,435,600
ggplot(age_comparison, aes(x=age_group, y=n, color=factor(wave)))+
  geom_line() +
  #geom_bar(position="dodge", stat= "identity", fill="#69b3a2") +
  labs(x = "Age", y = "Count", color="wave")

2.5 Expats by country of origin

The Facebook data contains estimates of expats from 28 countries. Of course, this is not representative of all of the countries migrants come from.

## unique expat groups (country of origin)
(fb.origins <- unique(fb$origin))  # there are only 28, as noted in the paper
##  [1] "Australia"   "Austria"     "Canada"      "China"       "France"     
##  [6] "Germany"     "Greece"      "Hungary"     "India"       "Indonesia"  
## [11] "Ireland"     "Israel"      "Italy"       "Japan"       "Malaysia"   
## [16] "Mexico"      "Nepal"       "Philippines" "Poland"      "Portugal"   
## [21] "Puerto Rico" "Romania"     "Russia"      "Singapore"   "South Korea"
## [26] "Spain"       "UAE"         "Vietnam"

Broken down by country of origin, we see the largest number of migrants in the US are of Mexican origins. This is true for all 7 waves (from January 2017 to July 2018). Notably, the number of Mexican migrants is significantly larger than the estimated migrant population from the other 27 origin countries.

There is a marked a marked difference in the number of migrants from Puerto Rico between Waves 3 and 4, 4 and 5, and beyond Wave 5.

## size of expat groups in the us, according to the fb data
fb.expat <- fb %>% #filter(wave <= 5) %>%  # the counts for waves 6 and 7 look very different when compared to wave 5
  group_by(origin, wave) %>% 
  summarise(expat_pop = sum(expat_population), .groups='drop') %>% 
  arrange(wave, -expat_pop) %>%
  mutate(wave = paste0("pop_wave", wave)) %>%
  pivot_wider(names_from=wave, values_from=expat_pop) 

fb.expat %>%
  kable(format.args=list(big.mark=','), caption="estimated migrant population in the US by country of origin, based on FB advertising data") %>%
  kable_paper("hover")
Table 2.2: estimated migrant population in the US by country of origin, based on FB advertising data
origin pop_wave1 pop_wave2 pop_wave3 pop_wave4 pop_wave5 pop_wave6 pop_wave7
Mexico 8,499,220 8,382,780 8,556,120 7,941,350 8,488,080 8,992,300 9,167,700
India 1,672,300 1,734,900 1,762,700 1,603,500 1,718,990 2,186,400 2,225,300
Philippines 1,276,050 1,345,540 1,373,840 1,240,100 1,346,940 1,817,800 1,858,400
Puerto Rico 1,029,590 1,058,380 1,162,570 964,530 1,235,470 1,765,600 1,816,200
China 651,730 594,080 591,150 456,350 442,500 1,122,800 1,111,300
Vietnam 612,160 607,080 622,820 570,230 641,280 1,214,800 1,239,400
Canada 444,500 437,630 430,930 364,070 457,240 1,052,500 1,019,600
South Korea 354,600 330,420 331,800 282,540 328,890 1,014,300 1,010,800
Germany 284,280 249,940 250,470 213,850 225,320 926,300 926,000
Japan 209,980 199,340 198,580 171,820 212,840 944,400 942,200
Russia 197,470 187,830 189,040 171,750 203,440 939,900 944,800
France 193,460 173,720 171,420 140,670 151,120 929,500 929,300
Italy 182,790 165,330 167,910 141,410 167,720 919,100 925,800
Poland 174,920 169,380 169,030 159,560 184,070 958,300 958,900
Indonesia 156,280 123,860 126,190 92,340 111,970 914,300 919,300
Nepal 151,170 152,070 152,730 141,940 175,770 924,500 926,800
Spain 123,120 168,330 167,850 146,870 126,120 907,600 906,500
Australia 111,870 123,270 125,220 105,300 115,120 912,800 914,500
Israel 110,500 104,720 105,530 93,550 104,180 916,500 921,400
Romania 101,390 97,880 97,680 85,780 110,360 900,800 900,700
Malaysia 68,970 62,760 62,900 52,250 77,070 900,000 900,000
Ireland 67,540 68,100 68,090 62,510 63,220 900,300 900,800
Portugal 67,140 69,720 70,540 65,320 84,670 900,400 901,200
Hungary 38,030 37,900 38,010 34,650 36,490 900,000 900,000
Singapore 35,950 35,490 35,580 31,240 51,800 900,000 900,000
UAE 35,630 35,210 35,040 31,060 68,540 900,000 900,000
Austria 26,430 26,360 26,190 24,070 24,760 900,000 900,000
Greece 18,000 18,000 18,000 18,000 18,000 900,000 900,000

Data note: There is a large difference in counts between Waves 5 and 6 (and similarity between Waves 6 and 7). This illustrates that Facebook data isn’t always stable: it can suddenly change. Alexander et al note that a limitation is the opaqueness of the Facebook algorithm or shift in privacy policies. For example, it is interesting to note that the Cambridge Analytica data scandal (where the personal data belonging to millions of Facebook users was collected without their consent) occurred in March 2018 (during Wave 6). Perhaps the increase in media attention regarding social media privacy policies spurred Facebook to change their algorithms, shift their estimates, or change how data is shared in the Ads manager (i.e. bottom-coding, rounding).

We now create a chord map to visualize the migration patterns of those in the FB data set. Examining just the 3 largest migrant groups in the US, as well as Puerto Rico, for wave 4 (October 2017), we see that

  • California and Texas are especially popular destinations for those from Mexico
  • California is a popular relocation site among those from India
  • California is the top destination among immigrants from the Philippines
  • Florida is the top destination for immigrants from Puerto Rico
#selecting relevant variables: you need these three for a chord chart
# this is wave 4 only
mig_flow <- fb %>% filter(wave == 4 & origin %in% c("Mexico", "India", "Philippines", "Puerto Rico")) %>% group_by(state, origin) %>% summarize(expat_population = sum(expat_population), .groups='drop') 

#putting our data into the mig_chord function 
mig_flow %>% mig_chord()

Focusing on just migrants from Puerto Rico, we see that in wave 4 (Oct 2017), California, Connecticut, Florida, Georgia, Illinois, Massachusetts, New Jersey, New York, North Carolina, Ohio, Pennsylvania, Texas, Virginia, and Wisconsin appear to be popular destinations.

# wave 4 pr
mig_flow4 <- fb %>% filter(wave==4, origin== "Puerto Rico" ) %>% #, state=="Texas"|state=="Florida"|state=="New Jersey"|state=="New York"|state=="Pennsylvania"|state=="Connecticut"|state=="Illinois"|state=="California"|state=="Massachusetts"|state=="Washington") %>% 
  select(origin, state, expat_population)

#putting our data into the mig_chord function 
mig_flow4 %>% mig_chord()

When we compare the migration patters to Wave 5 (January 2018), we see similar patterns. Hurriciane Maria occurs between Waves 4 and 5. While not evident in the chord maps, when we look at the numbers, we find a slight increase in Puerto Rican migrants to Florida between Wave 4 and Wave 5. In the same time frame, we see a slight decrease in migrants to California and Texas.

## wave 5
mig_flow5 <- fb %>% filter(wave==5, origin== "Puerto Rico") %>%#, state=="Texas"|state=="Florida"|state=="New Jersey"|state=="New York"|state=="Pennsylvania"|state=="Connecticut"|state=="Illinois"|state=="California"|state=="Massachusetts"|state=="Washington") %>% 
  select(origin, state, expat_population)

mig_flow5 %>% mig_chord()

2.6 WA FB total population + expat population

Here, we use the FB data to see what it says about the migrant population in Washington. First, we estimate total migrant population. Then, we take a closer look at the migrant population in Washington by country of origin.

When estimating migrant population in Washington, we find a steady increase from Wave 1 to Wave 5, while the total population in WA is declining. Then, there is a sharp increase in the migrant population between Wave 5 and Wave 6. This suggests that the total count data may not be very reliable for smaller migrant groups, illustrating a potentially significant barrier in using FB data for research. As a result, we leave out waves 6 and 7 when estimating the migration population in Washington according to country of origin. While there are slight fluctuations over time, a majority of the migrants are coming from Mexico, India, Vietnam, China, and Canada. The number of migrants coming from Mexico to Washington are significantly larger than those from other countries.

## estimated wa population by wave, according to fb
wa.fb <- fb %>%
  filter(state == "Washington") %>%
  # remove duplicate entries
  distinct(state, age_group, sex, wave, facebook_population, date) %>%
  group_by(wave, date) %>%
  summarize(facebook_population = sum(facebook_population), .groups='drop')

## estimated expat population in wa, according to fb
  # notice there's a big jump in numbers between waves 5 and 6
fb %>% 
  filter(state == "Washington") %>%
  group_by(wave) %>%
  summarize(expat_pop = sum(expat_population)) %>%
  # merge in fb population for wa
  left_join(wa.fb, by="wave") %>%
  mutate(expat.prop = round(expat_pop/facebook_population, 3)) %>%
  relocate(date, .after=wave) %>%
  kable(format.args=list(big.mark=','), caption="estimated population sizes in WA, based on FB advertising data",
        col.names=c("wave", "date", "estimated WA expat pop", "estimated WA total pop", "proportion WA expat")) %>% 
  kable_paper("hover")
Table 2.3: estimated population sizes in WA, based on FB advertising data
wave date estimated WA expat pop estimated WA total pop proportion WA expat
1 2017-01-01 459,180 4,250,000 0.108
2 2017-04-01 465,730 4,050,000 0.115
3 2017-06-01 465,890 4,040,000 0.115
4 2017-10-01 431,550 3,930,000 0.110
5 2018-01-01 445,950 3,790,000 0.118
6 2018-03-01 793,700 3,960,000 0.200
7 2018-07-01 811,500 3,890,000 0.209
## estimated expat population in wa by country of origin (waves 1-5 only)
fb %>%
  filter(state == "Washington" & wave <= 5) %>%  # looks ok(?) for can, cn, in, mx, ph, vn in waves 6, 7. there's a sizable increase in numbers between waves 5 and 6
  # calculate number of expats by country of origin, by wave
  group_by(wave, date, origin) %>%
  summarize(expat_pop = sum(expat_population), .groups='drop') %>%
  mutate(wave = paste0("wave", wave)) %>%
  pivot_wider(names_from=c(wave, date), values_from=expat_pop) %>%
  kable(format.args=list(big.mark=','), 
        caption = "estimated number of migrants in WA by country of origin, based on FB advertising data") %>%
  kable_paper("hover")
Table 2.3: estimated number of migrants in WA by country of origin, based on FB advertising data
origin wave1_2017-01-01 wave2_2017-04-01 wave3_2017-06-01 wave4_2017-10-01 wave5_2018-01-01
Australia 4,020 4,710 4,710 3,980 4,230
Austria 460 450 490 420 450
Canada 23,810 23,540 23,510 21,610 23,090
China 24,820 23,940 23,700 19,260 18,380
France 3,590 3,540 3,470 3,020 3,210
Germany 8,660 8,410 8,340 7,390 7,510
Greece 360 360 360 360 360
Hungary 520 600 610 510 550
India 56,710 60,390 60,730 57,440 60,200
Indonesia 5,730 4,980 4,920 3,880 4,530
Ireland 1,500 1,440 1,500 1,320 1,340
Israel 1,830 1,790 1,820 1,660 1,870
Italy 3,080 2,870 2,960 2,420 2,480
Japan 10,220 9,710 9,690 8,470 9,010
Malaysia 1,860 1,880 1,870 1,510 1,670
Mexico 190,800 194,100 193,600 186,100 190,600
Nepal 3,040 2,880 2,900 2,720 2,850
Philippines 46,920 51,890 51,680 46,920 49,640
Poland 1,870 1,830 1,790 1,600 1,730
Portugal 390 440 420 380 400
Puerto Rico 6,650 4,660 4,730 4,410 4,880
Romania 3,980 4,060 3,990 3,620 3,760
Russia 9,760 9,750 9,700 9,190 9,570
Singapore 1,180 1,200 1,210 1,050 1,150
South Korea 14,470 13,910 13,870 12,280 13,560
Spain 2,170 2,170 2,110 1,790 1,870
UAE 710 690 710 610 650
Vietnam 30,070 29,540 30,500 27,630 26,410

2.7 Maps

Maps are a great way to visualize migration patterns. Static maps are the most straight-forward way to showcase a map.

ggplot2

Here, we look at the data for Wave 4 (this is when Hurricane Maria took place). The maps visually show what we found earlier: that the estimated population size for migrants from Mexico is much larger than those from other countries. To attenuate the numerical difference, the logged population size is displayed on the map. It’s also important to remember that the Facebook data only covers 28 countries: it is not the case that there are no migrants from the entire African continent or Latin America.

## plot the data on expat groups
# this is from ggplot2
world_map <- map_data("world")  
#head(world_map)  # this is lat/long point data

## map of fb data for october 2017
fb.pop.counts.w4 <- fb %>% 
  filter(wave == 4) %>%  
  group_by(origin, wave) %>%
  summarize(pop = sum(expat_population), .groups='drop') %>%
  arrange(pop) 

fb.pop.counts.w4 %>%
  # need to keep all country outline data
  right_join(world_map,
            by=c("origin"="region") ) %>%
  # note that mexico has a large number while other countries are much smaller in comparison
  # might need to log scale, or set manual breaks for comparison
  ggplot() +
  geom_polygon(aes(x=long, y=lat, group=group, fill=log(pop)), color="black") +
  coord_fixed(1.3) +
  scale_fill_distiller(na.value="white", direction=1, palette="Reds", name="estimated expats \n in log(population)") +
  labs(subtitle="Country of origin of US migrants, estimated from FB data (Oct 2017)") +
  theme_void() +
  theme(legend.position="bottom")

Unlike static maps, interactive maps allow the user to zoom in and out, identify specific features, or query underlying data or indicator (i.e. socioeconomic status). Here, we created a quick and simple interactive map for exploratory data analysis, first using the mapview library, then the tmap library.

mapview

You can find more detailed instructions on how to create an interactive map with this mapview tutorial and the CRAN documentation.

## download world outline data (sf)
#require(rnaturalearth)
world <- ne_countries(scale="medium", returnclass="sf")

## create interactive map displaying fb estimates by country of origin
world %>% 
  sf::st_transform(crs=4326) %>%  
  # merge in fb population estimates to shapefile
  left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>%
  select(admin, name, pop_est, continent, region_un, subregion, geometry, wave, pop) %>%
  mapview(zcol="pop",  # which column to map
          map.types=c("OpenStreetMap", "CartoDB.Positron", "Esri.WorldImagery"),  # define which base map(s) to display
          at=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop) ),  # manually define categorical breaks
          col.regions=brewer.pal(7, "YlGn")   # specify color palette              
          )

Hovering over one of the shaded countries will give the estimated migrant population size from that country.

tmap

The tmap library can be used to create both static and interactive maps. Static maps are the default.

#require(tmap)
## create static tmap

tmap_options(check.and.fix = TRUE)

world %>% 
  sf::st_transform(crs=4326) %>%  
  left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>% 
  tm_shape() +
  tm_fill(col="pop", # which column to plot
          id="name", # what to display
          breaks=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop)),  # manually define breaks
          palette="YlGn") + # specify color palette
  tm_borders()  # add in borders

To create an interactive map, specify the mode to view using the command tmap_mode("view"). To toggle it back to static maps, use tmap_mode("plot"). The same code is used to generate the static and interactive maps.

## create interactive map displaying fb estimates by country of origin
tmap_mode("view")   # interactive mode
#tmap_mode("plot")  # static map (this is the default)

world %>% 
  sf::st_transform(crs=4326) %>%  
  left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>% 
  tm_shape() +
  tm_fill(col="pop", # which column to plot
          id="name", # what to display
          breaks=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop)),  # manually define breaks
          palette="YlGn") + # specify color palette
  tm_borders()  # add in borders

Clicking on a country will generate a popup that shows the estimated size of the migrant population from that country.


3 American Community Survey (ACS) data

The code provided by Alexander et al on the GitHub repo indicates they downloaded the 2017 1-year ACS data with sex, age, birthplace (bpl), and state (statefip) variables from IPUMS. They then perform some data cleaning (the code needed for cleaning the data is available on their Github repo).

We read in the pre-saved ACS 2017 1 year data here:

# read in pre-saved ACS 2017 1-year data
#acs_prop_age <- read_csv("./data/acs_prop_age.csv")  # if cloned/downloaded the github repo, otherwise
acs_prop_age <- read_csv("https://raw.githubusercontent.com/MJAlexander/fb-migration-hurricane-maria/master/data/acs_prop_age.csv")

4 Comparing FB and ACS data

Alexander et al note that Facebook data are not representative of the broader population and it is difficult for the findings to be generalizable. However, the ACS is designed to be nationally representative. Therefore, the comparison of the FB estimates with the ACS estimates enables us to assess this potential bias.

4.1 National level differences

It is important to recall that the FB data encapsulates only 28 different countries. When looking at this data, we see that there are differences in the number of estimated migrants when compared to that of the ACS. There are 10 countries where the FB estimate is greater than the ACS estimate. For the remaining 18 countries, the ACS estimate is larger than the FB estimate. This indicates a misalignment between the data sources.

## national level estimates of number of "expats" in the acs, compared to fb estimates
acs.fb.comp <- acs_prop_age %>%
  # calculate number of estimated migrants by country of birth  
  group_by(origin, bpl) %>%
  summarize(acs.est = sum(no_mig), .groups='drop') %>%
  select(-bpl) %>%
  # keep only the countries available in the fb data
  filter(origin %in% fb.origins) %>%
  # merge in fb expat data estimates
  right_join(fb.expat %>% select(1:5) %>%  # waves 1-4 "technically" more comparable since covers 2017
               rowwise() %>% mutate(fb.avg = mean(c_across(cols=2:5))) %>% ungroup() , # compute mean estimate over the 4 waves
             by="origin") %>%
  # compute difference between acs and fb estimates
  mutate(diff = acs.est - fb.avg) %>%
  arrange(diff) %>%
  relocate(fb.avg, .after=acs.est) %>%
  relocate(diff, .after=fb.avg)

kable(acs.fb.comp %>% select(1:4), format.args=list(big.mark=',')) %>%
  kable_paper("hover")
origin acs.est fb.avg diff
Indonesia 70,611 124,667.5 -54,056.5
Nepal 97,594 149,477.5 -51,883.5
Spain 102,006 151,542.5 -49,536.5
UAE 16,201 34,235.0 -18,034.0
Australia 101,345 116,415.0 -15,070.0
France 157,959 169,817.5 -11,858.5
Hungary 29,668 37,147.5 -7,479.5
Malaysia 55,461 61,720.0 -6,259.0
Austria 19,800 25,762.5 -5,962.5
Singapore 30,028 34,565.0 -4,537.0
Italy 168,926 164,360.0 4,566.0
Ireland 73,093 66,560.0 6,533.0
Israel 111,919 103,575.0 8,344.0
Romania 118,938 95,682.5 23,255.5
Puerto Rico 1,087,557 1,053,767.5 33,789.5
Portugal 106,827 68,180.0 38,647.0
Greece 82,858 18,000.0 64,858.0
Philippines 1,416,830 1,308,882.5 107,947.5
Poland 289,660 168,222.5 121,437.5
Canada 575,457 419,282.5 156,174.5
Japan 363,601 194,930.0 168,671.0
Vietnam 1,001,298 603,072.5 398,225.5
South Korea 817,988 324,840.0 493,148.0
Germany 757,517 249,635.0 507,882.0
Russia 803,607 186,522.5 617,084.5
India 2,541,235 1,693,350.0 847,885.0
China 1,917,860 573,327.5 1,344,532.5
Mexico 9,846,770 8,344,867.5 1,501,902.5

Note: in the table above, positive values indicate a larger ACS estimate and negative values suggest larger Facebook values.

## map the results, to see if potential regional differences
world %>% left_join(acs.fb.comp, by=c("admin"="origin")) %>%
  select(admin, name, pop_est, continent, region_un, subregion, geometry, acs.est, fb.avg, diff) %>%
  mapview::mapview(zcol="diff",  # which column to map
                   map.types=c("OpenStreetMap", "CartoDB.Positron", "Esri.WorldImagery"),  # define which base map(s) to display
                   at=c(min(acs.fb.comp$diff), -1e4, 0, 1e5, 5e5, 1e6, max(acs.fb.comp$diff)),  # manually define categorical breaks
                   col.regions=brewer.pal(6, "RdBu")
                   )

4.2 State level differences

When examining differences in the PR expat population by state, the authors exclude North Carolina from the analysis. They explain that it is in an anomaly. They also examined only states with 18,000+ PRs to avoid rounding issues. For these 11 states, while the ACS and Facebook data do yield different migrant population estimates at the state level, most of them are comparable. However, the disparity between the two data sets are slightly greater (indicating a greater misalignment) for Connecticut and New Jersey. The states with the largest PR populations are Florida, New York, and Pennsylvania.

# 4. Top states based on PR population --------------------------------

# table of expat populations and proportions in facebook and acs
# remove North Carolina because there seems to be an anomaly in the data 
# filter to include only states above 18000 to avoid rounding issues. 

pr.state <- fb %>% 
  # authors do not include north carolina in their analysis for some reason (??)
  filter(!state %in% c("North Carolina")) %>% 
  # keep fb wave 4 data only, and origin country = pr
  filter(wave==4, origin=="Puerto Rico") %>% 
  # merge in acs data on pr population
  left_join(acs_prop_age %>% 
              filter(origin == "Puerto Rico"),
            by=c("state", "origin", "age_group", "sex")) %>%   
  # calculate for each state:
  group_by(state) %>% 
  summarise(expat_population = sum(expat_population),  #  total fb pr expat population
            total_pop = sum(facebook_population),   # total fb population 
            expat_population_acs = sum(no_mig),  # total acs pr migrant pop
            expat_proportion_acs = round(sum(prop_pop), 3)  # proportion of migrants in state
            ) %>% 
  # calculate fb expat proportion (out of all fb users)
  mutate(expat_proportion = round(expat_population/total_pop, 3)) %>% 
  select(state, expat_population, expat_population_acs, expat_proportion, expat_proportion_acs) %>% 
  # arrange in decreasing numbers of fb pr migrants by state
  arrange(-expat_population) 

# states where pr expat pop > 18000 (12, if north carolina not dropped)
pr.state %>% 
  filter(expat_population>18000) %>%
  kable(format.args=list(big.mark=',')) %>%
  kable_paper("hover")
state expat_population expat_population_acs expat_proportion expat_proportion_acs
Florida 304,100 302,931 0.028 0.052
New York 107,000 131,612 0.009 0.022
Pennsylvania 91,800 100,303 0.014 0.027
Massachusetts 71,800 88,683 0.020 0.042
Texas 57,160 52,574 0.004 0.006
Connecticut 47,220 63,956 0.028 0.059
New Jersey 42,640 78,691 0.012 0.029
Illinois 23,590 26,677 0.003 0.007
Ohio 22,820 25,581 0.004 0.007
California 19,500 23,873 0.001 0.002
Georgia 19,110 19,892 0.003 0.006

4.3 Age distribution

(Figure 1 from the paper)

When examining the age distribution of male migrants in 9 states, it is evident that the ACS has an older age distribution while the Facebook distributions are greater for the under-30 age groups. The ACS trends differ in all age groups.

# 7. age change -----------------------------------------------------------

## compare age distributions (figure 1 in the paper)
acs_prop_age %>% 
  # pr individuals in the acs only
  filter(origin=="Puerto Rico") %>% 
  # merge in fb data on pr expats from wave 1 (jan 2017)
  left_join(fb %>% filter(wave==1, origin=="Puerto Rico")) %>% 
  # calculate proportion of expats in each age group out of total fb pop per sex-state combination
  group_by(state, sex) %>% 
  mutate(expat_prop = expat_population/sum(expat_population, na.rm = T)) %>% 
  ungroup() %>%
  select(state, sex, age_group, prop_mig, expat_prop) %>% 
  # look at 9 selected states, males
  filter(sex==1, state %in% c("Florida", "New York", "New Jersey", 
                              "Pennsylvania", "Connecticut", "Illinois",
                              "California", "Texas", "Massachusetts")) %>% 
  ggplot(aes(age_group, prop_mig)) + 
  geom_line(lwd = 0.8) + 
  geom_line(aes(age_group, expat_prop), color = "red", lty = 2, lwd = 0.8) + 
  facet_wrap(~state)+
  theme_bw(base_size = 14) + 
  ylab("proportion") + xlab("age group")

(FB data represented by the red dashed line; ACS data by the solid black line.)

4.4 Sex ratio

The following Table is Table 1 in the paper. It compares the sex ratio estimated from the 2017 ACS with the FB data from wave 1 (Jan 2017). The Facebook sex ratios are generally larger than the ACS sex ratios, indicating higher proportions of men than women. The visualization compares the difference in estimated sex ratios by data source for selected states.

# 8. sex ratios -----------------------------------------------------------

#https://stats.stackexchange.com/questions/21167/standard-error-of-a-ratio

## sex ratios in ACS vs fb, by state  
sr.nat <- acs_prop_age %>% 
  mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>% 
  group_by(state, pr, sex) %>%
  summarise( no_mig = sum( no_mig, na.rm = T)) %>% 
  group_by(state, pr) %>% 
  # acs sex ratio (2017)
  summarise(sr_acs = no_mig[sex==1]/no_mig[sex==2]) %>% 
  group_by(state) %>% 
  filter(state %in% c("Florida", "New York", "New Jersey", 
                      "Pennsylvania", "Illinois", "California", "Connecticut",
                      "Texas", "Massachusetts", "Ohio", "Georgia")) %>% 
  left_join(fb %>% 
              mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>% 
              mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>% 
              group_by(wave, state, pr, sex) %>%
              summarise(expat_population = sum(expat_population, na.rm = T)) %>% 
              group_by(wave, state, pr) %>% 
              summarise(sr_fb = expat_population[sex==1]/expat_population[sex==2]) %>%  # fb sex ratio (m:f)
              filter(state %in% c("Florida", "New York", "New Jersey", 
                                  "Pennsylvania", "Illinois", "California", "Connecticut",
                                  "Texas", "Massachusetts", "Ohio", "Georgia"), wave<6)) %>%  ungroup() %>%
  filter(pr==1, wave==1) %>% select(-pr, -wave)  # oct 2017

## table comparing sex ratio in acs data vs fb data (table 1 in the paper)
sr.nat %>%
  kable(digits=3, col.names=c("state", "ACS sex ratio", "Facebook sex ratio")) %>%
  kable_paper()
state ACS sex ratio Facebook sex ratio
California 1.298 0.937
Connecticut 0.854 1.352
Florida 1.015 1.187
Georgia 1.093 1.238
Illinois 1.094 1.143
Massachusetts 0.883 1.251
New Jersey 1.020 1.202
New York 0.905 1.194
Ohio 1.027 1.151
Pennsylvania 0.979 1.175
Texas 1.067 1.013
## plot difference in sex ratio by data source, for select states
sr.nat %>%
  pivot_longer(2:3, names_to="source", values_to="sr") %>%
  separate(source, into=c(NA, "source")) %>%
  arrange(source, sr) %>%
  ggplot(aes(y=fct_reorder2(state, source, sr), # sort values
             x=sr, color=source)) +
  geom_point() +
  geom_vline(xintercept=1, linetype=2) +
  labs(x="sex ratio", y="state") +
  scale_y_discrete(limits=rev) +  # for reversing the order of the states in the plot
  theme_bw() +
  theme(legend.position="bottom")


5 Analysis: Difference-in-Differences estimation

Difference-in-differences models are panel data methods applied when we are attempting to understand the effect of a particular treatment. Difference-in-differences models are used when some groups are exposed to a treatment and others are not. In this study, Alexander et al used difference-in-differences to estimate the percent change in Puerto Rican migrants in the continental US after Hurricane Maria. They choose this method to overcome issues of non-representativness and fluctuations in the data over time (indpendent of migration events).

Difference-in-difference models haven been be used to study the effects of a particular policy, in papers focused on the economics of education, and labor economics.

In this paper, the diff-in-diff estimator is \[ \pi_{g} = \frac{P_{g}^{PR}(5) - P_{g}^{PR}(4)}{P_{g}^{PR}(4)} - \frac{P_{g}^{c}(5) - P_{g}^{c}(4)}{P_{g}^{c}(4)} \] where \(g\) is the group of interest (age, sex, country of origin, state of residence). The authors are interested in assessing the difference in the size of the migrant population from Puerto Rico between waves 4 and 5, relative to the change in the size of the migrant population from all other countries, sans Mexico.

The diff-in-diff approach assumes the control group, which is not subject to the treatment (here, Hurricane Maria), will continue to exhibit similar responses, relative to the treatment group. The difference between groups is the effect of the treatment. The treatment group is Puerto Rican migrants, while the control group is composed of migrants from all other countries aside from Mexico, which is assumed to follow different migration motivations. Effectively, in using this estimator, we are comparing the change in the size of the PR migrant population in the US between waves 4 and 5 with the change in population for all other migrants (but not Mexican immigrants), also between waves 4 and 5.

5.1 National diff-in-diff estimate of out-migration from Puerto Rico

Using the difference-in-difference model, Alexander et al find a 17% increase in the number of Puerto Rican migrants present in the continental US between October 2017 and January 2018.

exp_prop_usa <- fb %>% 
  # exclude migrants from puerto rico or mexico
  filter(origin != "Puerto Rico", origin != "Mexico") %>%
  # calculate number of expats in each wave
  group_by(wave) %>% 
  summarise(expat = sum(expat_population)) %>% 
  # add in total number of fb users per wave
  left_join(total_fb_pop, by="wave") %>% 
  # calculate proportion of expats out of total fb population
  mutate(prop = expat/fb) %>% 
  # merge this information back into fb data
  left_join(fb %>% 
              # calculate number of pr expats in the us, per wave
              filter(origin=="Puerto Rico") %>% 
              group_by(wave) %>% 
              summarise(expat_pr = sum(expat_population)),
            by="wave") %>% 
  # calculate proportions
  mutate(prop_pr = expat_pr/fb,  # proportion of pr expat users among total fb users
         var_prop = prop*(1-prop)/(fb/1000),  # binomial variance for all other migrants (excludes mx) 
         var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>%  # binomial variance for pr migrants
  # calculate diff in diff  
  mutate(prop_diff = (prop - lag(prop))/lag(prop),  # calculate difference in proportion of all other migrants between waves
         se_diff = sqrt(var_prop + lag(var_prop)),  
         prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr),  # calculate difference in pr migrants
         se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),  
         diff_in_diff = (prop_pr_diff - prop_diff)*100,  # calculate diff-in-diff estimator, x100 to get pct
         se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100)  # calculate se for diff-in-diff estimator

# pull out diff-in-diff estimator (+se, 95% ci) between waves 4 and 5
res <- exp_prop_usa %>% filter(wave==5) %>% 
  select(percent_increase = diff_in_diff, se = se_diff_diff) %>% 
  # calculate ~95% ci
  mutate(ci_lower = percent_increase - (2*se), 
         ci_upper = percent_increase + (2*se))

res %>%
  kable(digits=2) %>%
  kable_paper()
percent_increase se ci_lower ci_upper
17.03 0.07 16.88 17.18

5.2 National estimate of PR out-migrants to the US

The authors then take the estimate percent changes obtained from the difference-in-differences model and multiplies them by the relevant populations reported in the ACS. This enables us to estimate the change in the number of migrants.

\[\hat{M}_{g} = \hat{\pi}_{g} \times P_{g}^{ACS}\] where \(\hat{M_{g}}\) is the estimated number of migrants in group \(g\), \(\hat{\pi}_{g}\) is the proportional change in PR migrants in group \(g\), and \(P_{g}^{ACS}\) is the estimated number of PR migrants in group \(g\) in the ACS.

(The 2017 ACS indicates there were 1,087,557 Puerto Ricans residing in the US.)

acs_prop_age %>% 
  filter(origin=="Puerto Rico") %>% 
  # calculate mean number of migrants, using previously calculated percentage of movers and 95% bounds
  summarise(mean = sum(no_mig)*res$percent_increase/100, 
            lower = sum(no_mig)*res$ci_lower/100, 
            upper = sum(no_mig)*res$ci_upper/100) %>%
  kable(format.args=list(big.mark=",")) %>%
  kable_paper()
mean lower upper
185,183.4 183,567.5 186,799.4

5.3 State diff-in-diff estimates of in-migrants from PR

The authors then repeat the same approach at the state level, presenting results for states where the estimated PR population in the FB data exceeds 18,000.

Data note: In the code, expat_pop = 20 seems to be some kind of bottom-code or placeholder for missing data or small counts (20 is the minimum count in the data). The authors only examine counts greater than 20 for their analysis, but this is not explained in the paper or the code. This may be due to limited documentation with the Facebook data. Additionally, North Carolina is also excluded from the analysis (there isn’t an explanation as to why).

# 5a. State by state analysis (table) ----------------------------------------------

## calculate approximate population per state, by state and wave, using fb data
total_fb_pop_state <- fb %>% filter(origin=="Mexico") %>% 
  group_by(wave, state) %>% 
  summarise(fb = sum(facebook_population), .groups='drop')

## calculate expat population proportions by waves
ex_props <- fb %>%
  filter(!state %in% c("North Carolina")) %>%  # exclude north carolina (?)
  # exclude all expats from puerto rico, mexico, where expat_pop = 20
  filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>20) %>%
  # calculate for each state and fb wave, total number of expats from all origins besides mx, pr
  group_by(state, wave) %>%
  summarise(expat = sum(expat_population), .groups='drop') %>%
  # merge back fb population by state
  left_join(total_fb_pop_state,
            by=c("state", "wave")) %>% 
  # calculate proportion of fb expats in each state
  mutate(prop = expat/fb) %>% 
  group_by(state) %>%
  # merge in data on pr migrants from the fb data
  left_join(fb %>%
              filter(origin=="Puerto Rico") %>%
              # calculate number of pr expats in each state, by wave
              group_by(state, wave) %>%
              summarise(expat_pr = sum(expat_population), .groups='drop'),
            by=c("state", "wave") ) %>% 
  # calculate for each state
  mutate(prop_pr = expat_pr / fb,  # proportion of pr expats out of total fb users
         var_prop = prop*(1-prop)/(fb/1000),  # variance for proportion of expats out of total fb users 
         var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>%  # variance for proportion of pr users out of total fb users
  # calculate diff in diff estimator for each state
  # this will throw some warning messages for several states (ak, nd, sd, vt, wy)
  # (from the huge difference in expat_pop between waves 5 and 6, 6 and 7)
  group_by(state) %>%
  mutate(prop_diff = (prop - lag(prop))/lag(prop),  # difference in proportion of expats between waves
         se_diff = sqrt(var_prop + lag(var_prop)),  
         prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr),  # difference in pr expats between waves
         se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),  
         diff_in_diff = (prop_pr_diff - prop_diff)*100,  # diff-in-diff estimator * 100 (to get percent)
         se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) %>%  
  ungroup()

# table with percent increases and numbers
ex_props %>%  
  filter(expat_pr>18000, !is.na(diff_in_diff), wave==5) %>% 
  select(state, prop_pr, diff_in_diff, se_diff_diff) %>% 
  mutate(percent_increase = round(diff_in_diff, 1),  
         percent_lower = percent_increase - (2*se_diff_diff), 
         percent_upper = percent_increase + (2*se_diff_diff)) %>%   
  select(state, percent_increase, percent_lower, percent_upper) %>% 
  # arrange states by large -> small pct increase in pr expats  
  arrange(-percent_increase) %>% 
  # add in acs data
  left_join(acs_prop_age %>% 
              filter(origin=="Puerto Rico") %>% 
              # calculate nubmer of pr migrants by state
              group_by(state) %>%  
              summarise(mig = sum(no_mig))) %>% 
  # calculated estimated number of pr migrants in the us using fb pct estimate and acs pop est
  mutate(estimated_number = round(mig*percent_increase/100),
         number_lower = round(mig*percent_lower/100),
         number_upper = round(mig*percent_upper/100)) %>% 
  select(-mig) %>% 
  arrange(-estimated_number) %>%
  kable(format.args=list(big.mark=','), digits=2) %>%
  kable_paper("hover")
state percent_increase percent_lower percent_upper estimated_number number_lower number_upper
Florida 21.6 20.91 22.29 65,433 63,342 67,525
New York 11.0 10.32 11.68 14,477 13,584 15,371
Pennsylvania 13.4 12.66 14.14 13,441 12,700 14,181
Connecticut 14.7 12.89 16.51 9,402 8,244 10,560
Massachusetts 10.1 8.82 11.38 8,957 7,824 10,090
Texas 10.8 10.37 11.23 5,678 5,452 5,904
Ohio 12.8 12.22 13.38 3,274 3,125 3,424
Illinois 9.9 9.15 10.65 2,641 2,441 2,841
Georgia 13.1 12.41 13.79 2,606 2,470 2,742
New Jersey 2.9 1.56 4.24 2,282 1,228 3,336
California 2.4 1.86 2.94 573 444 702

The estimates suggest post Hurricane Maria, many Puerto Ricans were moving to the states that have an existing population of Puerto Rican migrants. This can indicate community based migration/chain migration.

5.3.1 Map

(Figure 2 in the paper)

The largest increase of Puerto Rican migrants is in Florida.

# 5b. State by state analysis (map) ---------------------------------------

to_map <- ex_props %>%  
  # keep states/waves with over 18k pr fb expats
  filter(expat_pr>18000) %>% 
  # keep results from wave 5 (estimates difference between waves 4, 5)
  filter(!is.na(diff_in_diff), wave==5) %>% 
  select(state, prop_pr, diff_in_diff) %>% 
  mutate(percent_increase = round(diff_in_diff, 1)) %>% 
  select(state, percent_increase) %>% 
  arrange(-percent_increase) 

## load us map outline (part of ggplot2)
states_map <- map_data("state")

# create map for illustrating percent increase by state (choropleth map - this is figure 2 in the paper)
states_map %>% 
  # merge in data on percent increase by state
  left_join(to_map %>% 
              # format for merging with map data
              mutate(region = str_to_lower(state)) ) %>% 
  ggplot() + 
  geom_polygon(aes(x = long, y = lat, fill = percent_increase, group = group), color = "black") + 
  coord_fixed(1.3) +  # define aspect ratio
  theme_void() +  # removes lat/long axes
  scale_fill_distiller(na.value = "white", direction = 1, palette = "Reds", name = "percent increase")

5.4 Age-specific estimates of out-migrants from PR to the US

(Figure 3 in the paper)

In terms of who migrated out of Puerto Rico after Hurricane Maria, there is an increase in those in the 15-29 year age range. There were fewer people above the age of 35 migrating out of Puerto Rico in a post-disaster context.

# pr pop vs all other migrant pop, according to fb data
# (non-pr includes mexico)
d_tot <- 
  fb %>% 
  mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>% 
  # change pr expat pop of those age 35-39, male in connecticut to 4300. as given in data: 20. in wave 3 the number is 4300
  mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>% 
  # calculate total pr/non-pr migrant population for each age group, by wave
  group_by(pr, age_group, wave, date) %>% 
  summarise(expat_population = sum(expat_population), .groups='drop') %>%  # 7 waves x 9 age group x 2 migrant groups = 126 unique combinations
  group_by(wave, pr) %>% 
  # calculate proportion of age group out of total fb expat (pr, non-pr) population in that wave (separate for pr, non-pr)
  mutate(exp_prop = expat_population/sum(expat_population),  
         var_prop = exp_prop*(1-exp_prop)/expat_population*10) %>%  # associated variance 
    ungroup()  
## plot results (figure 3 in the paper, but with a confidence interval)
# line plot
d_tot %>% 
  group_by(age_group) %>%
  summarise(did = ((exp_prop[wave==5&pr==1]-exp_prop[wave==4&pr==1])*100 - (exp_prop[wave==5&pr==0]-exp_prop[wave==4&pr==0])*100),
            se_did = sqrt(sqrt(var_prop[wave==5&pr==1]+var_prop[wave==4&pr==1])^2 + sqrt(var_prop[wave==5&pr==0]+var_prop[wave==4&pr==0])^2),
            lower = did - (2*se_did), upper = did + (2*se_did) ) %>% 
  ggplot(aes(age_group, did)) + geom_line() + geom_hline(yintercept = 0) + 
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) + 
  xlab("Age group") + ylab("Change (percentage points)") + 
  theme_bw(base_size = 14)

5.5 Change in sex ratio of out-migrants from PR, by state

(Table 3 in the paper)

Pre- and post-hurricane, there were larger numbers of men than women in all nine states. However, there were some states (i.e. Texas) that show an increase of male migrants, while other states (i.e. Pennsylvania) saw an increase of female migrants.

## change in sex ratio between waves in the fb data, by state (table 3)
fb %>% 
  mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>% 
  mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>% 
  left_join(total_fb_pop_state) %>% 
  group_by(wave, state, pr, sex) %>%
  summarise(expat_population = sum(expat_population, na.rm = T),  # total expat_pop by combination
            fb = fb[row_number()==1],  # keep the first value in the fb column for each wave/sex combination
            prop = expat_population/fb,  # calculate pr expat pop as proportion of total fb pop
            var_prop = prop*(1-prop)/(fb/1000)) %>% # calculate variance for the proportion
  group_by(wave, state, pr) %>% 
  summarise(sr = expat_population[sex==1]/expat_population[sex==2],   # calculate sex ratio in fb data
            se = (prop[sex==1]/prop[sex==2])^2*(var_prop[sex==1]/(prop[sex==1]*100)^2 + var_prop[sex==2]/(prop[sex==2]*100)^2)) %>%  # calculate associated se (see stacks link from sex ratio section)
  group_by(state) %>% 
  # calculate difference in sex ratio between waves 4/5 and between pr, non-pr
  mutate(sr_diff = (sr[wave==5&pr==1]-sr[wave==4&pr==1])/sr[wave==4&pr==1] - (sr[wave==5&pr==0]-sr[wave==4&pr==0])/sr[wave==4&pr==0], 
         # calculate se for difference in sex ratio
         se_diff = sqrt(sqrt(se[wave==5&pr==1]^2+se[wave==4&pr==1]^2) + sqrt(se[wave==5&pr==0]^2+se[wave==4&pr==0]^2))) %>%   
  filter(state %in% c("Florida", "New York", "New Jersey", 
                      "Pennsylvania", "Illinois", "Ohio", "Connecticut",
                      "Texas", "Massachusetts", "Georgia", "California"), wave == 4, pr==1) %>%
  mutate(sex_ratio=round(sr,3),
         change = round(sr_diff,3), 
         lower = round(sr_diff - (2*se_diff),3), 
         upper = round(sr_diff + (2*se_diff),3)) %>% 
  select(state, starts_with("sex_ratio"), change, lower, upper) %>% 
  arrange(-change) %>%
  kable() %>%
  kable_paper()
state sex_ratio change lower upper
Florida 1.163 0.080 0.076 0.083
Texas 1.008 0.065 0.059 0.071
Georgia 1.140 0.018 0.006 0.031
Connecticut 1.352 0.014 0.004 0.024
New Jersey 1.187 0.014 0.005 0.023
California 0.962 0.011 0.000 0.021
Massachusetts 1.279 0.008 0.000 0.016
New York 1.215 0.006 0.000 0.011
Ohio 1.133 0.005 -0.007 0.016
Illinois 1.127 -0.002 -0.013 0.008
Pennsylvania 1.186 -0.053 -0.059 -0.047

6 Return migration

6.1 National diff-in-diff estimate of out-migration from the US to Puerto Rico

In estimating return migration, Alexander et al specifically look at the decrease in the Puerto Rican population in the continental US after the hurricane. However, it doesn’t seem like there is enough information to suggest that an inference about where the migrants who originate from Puerto Rico move to after their time in the US is appropriate. While it is likely that it is possible that they return, this migration pattern is inferred.

Using this assumption the results indicate a 1.8 percent of Puerto Ricans that return to Puerto Rico after the hurricane.

Data note: Authors recognize that there is a rounding issue after Wave 5, thus keeping only those records with expat_population over 1000)

# 6. Return migration -----------------------------------------------------

# national
ex_props_post_nat <- fb %>%
  filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>1000) %>%  # rounding issue after wave 5
  group_by(wave) %>%
  summarise(expat = sum(expat_population)) %>%  # calculate number of expats per wave
  # merge in total fb population in the us
  left_join(total_fb_pop, by="wave") %>% 
  # calculate proportion of expats in the total fb population
  mutate(prop = expat/fb) %>% 
  left_join(fb %>%
              filter(origin=="Puerto Rico", expat_population>1000) %>%
              # caclulate fb pr expat population by wave
              group_by(wave) %>%
              summarise(expat_pr = sum(expat_population)),
            by="wave") %>% 
  mutate(prop_pr = expat_pr / fb,  # proportion of pr expats out of total fb population
         var_prop = prop*(1-prop)/(fb/1000),  # variance of expats out of total fb population
         var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>%  # variance of pr expats out of total fb population
  mutate(prop_diff = (prop - lag(prop))/lag(prop),  # calculate difference between waves (all expats, -mx)
         se_diff = sqrt(var_prop + lag(var_prop)),  # se for the difference between waves
         prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr),  # calculate difference between waves, pr only
         se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),  # se for the difference between waves
         diff_in_diff = (prop_pr_diff - prop_diff)*100,  # calculate diff-in-diff estimator (in percent)
         se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100)  # se for diff-in-diff estimator (in percent)

res_return <- ex_props_post_nat %>% 
  # examine difference between waves 5 and 6
  filter(wave==6) %>% 
  select(diff_in_diff, se_diff_diff) %>% 
  rename(percent_increase = diff_in_diff, se = se_diff_diff) %>% 
  mutate(ci_lower = percent_increase - (2*se), ci_upper = percent_increase + (2*se))

res_return %>%
  kable(digits=2, caption="estimated percentage change") %>%
  kable_paper()
Table 6.1: estimated percentage change
percent_increase se ci_lower ci_upper
-2.01 0.06 -2.13 -1.88
## use these percentages with acs data to estimate number of return migrants to pr (at national level)
acs_prop_age %>% filter(origin=="Puerto Rico") %>% 
  ungroup() %>%  
  summarise(mean = sum(no_mig)*res_return$percent_increase/100, 
            lower = sum(no_mig)*res_return$ci_lower/100, 
            upper = sum(no_mig)*res_return$ci_upper/100) %>%
  kable(format.args=list(big.mark=','), caption="estimated number of return migrants") %>%
  kable_paper()
Table 6.1: estimated number of return migrants
mean lower upper
-21,823.71 -23,166.45 -20,480.97

6.2 State diff-in-diff estimates of out-migrants from the US to PR

Making the same assumption as above, Alexander et al also find that there is a decrease in the Puerto Rican population in Florida and Texas (the states with the greatest in-migration during the hurricane). Other states like California and New York saw an increase of Puerto Rican migrants during this time.

# by state
ex_props_post <- fb %>%
  # all other expats, but pr and mx
  filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>1000) %>%  # rounding issue after wave 5
  group_by(state, wave) %>%
  summarise(expat = sum(expat_population), .groups='drop') %>%  # calculate number of expats in each state per wave
  # merge back in total fb population
  left_join(total_fb_pop_state) %>% 
  mutate(prop = expat/fb) %>%  # calculate expat proportion out of total fb population
  group_by(state) %>%
  left_join(fb %>%
              filter(origin=="Puerto Rico") %>%
              group_by(state, wave) %>%
              summarise(expat_pr = sum(expat_population))) %>%  # number of pr expats per state
  # calculate proportions
  mutate(prop_pr = expat_pr / fb,  # pr expats as a proportion of total fb population
         var_prop = prop*(1-prop)/(fb/1000),  # variance for all expats as proportion of total fb population
         var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>%  # variance for pr expats as proportion of total fb pop
  # calculate diff in diff for each state and wave
  arrange(state, wave) %>% 
  group_by(state) %>%
  mutate(prop_diff = (prop - lag(prop))/lag(prop),  # difference between waves (all expats -mx, -pr)
         se_diff = sqrt(var_prop + lag(var_prop)),  # se for difference in "all" expats between waves
         prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr),  # difference in pr expats between waves
         se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),  # se for difference in pr expats between waves
         diff_in_diff = (prop_pr_diff - prop_diff)*100,  # diff-in-diff estimator (in percent)
         se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) %>%  # se for diff-in-diff estimator
  ungroup()
# notice that only 33 states

ex_props_post %>% 
  # keep only state-wave combinations where total pr expats greater than 18k (leaves 14 states [if nc included])
  #  and diff-in-diff estimators from wave 6 (this is the difference b/w waves 5 and 6)
  filter(expat_pr>18000, !is.na(diff_in_diff), wave == 6)  %>% 
  select(state, prop_pr, diff_in_diff, se_diff_diff) %>% 
  mutate(percent_change = round(diff_in_diff, 1),
         percent_lower = percent_change - (2*se_diff_diff), 
         percent_upper = percent_change + (2*se_diff_diff) ) %>% 
        select(state, percent_change, percent_lower, percent_upper) %>% 
        arrange(percent_change) %>% 
  # merge in acs estimates of pr indivdiuals in the us by state
  left_join(acs_prop_age %>% 
              filter(origin=="Puerto Rico") %>% 
              group_by(state) %>%  
              summarise(mig = sum(no_mig))) %>%  # calculate number of pr individuals in each state
  # calculate estimated number of pr return migrants + associated ci
  mutate(estimated_number = round(mig*percent_change/100),
         number_lower = round(mig*percent_lower/100),
         number_upper = round(mig*percent_upper/100)) %>% 
  # remove acs pr estimate per state
  select(-mig) %>% 
  # arrange states in order by estimated number of pr return migrants
  arrange(estimated_number) %>%
  kable(format.args=list(big.mark=','), digits=3, 
        caption="estimated return migrants to Puerto Rico for select states") %>%
  kable_paper()
Table 6.2: estimated return migrants to Puerto Rico for select states
state percent_change percent_lower percent_upper estimated_number number_lower number_upper
Florida -7.1 -7.766 -6.434 -21,508 -23,526 -19,490
Massachusetts -4.5 -5.524 -3.476 -3,991 -4,899 -3,083
Connecticut -3.6 -5.043 -2.157 -2,302 -3,226 -1,379
Texas -3.5 -3.908 -3.092 -1,840 -2,055 -1,625
North Carolina -7.5 -7.987 -7.013 -1,442 -1,535 -1,348
Pennsylvania -1.4 -2.027 -0.773 -1,404 -2,033 -776
Ohio -1.7 -2.143 -1.257 -435 -548 -322
New York 0.4 -0.241 1.041 526 -318 1,371
Illinois 2.8 2.150 3.450 747 574 920
Georgia 6.4 5.859 6.941 1,273 1,165 1,381
New Jersey 2.3 1.154 3.446 1,810 908 2,712
California 8.5 7.965 9.035 2,029 1,902 2,157
Virginia 13.6 12.880 14.320 2,995 2,836 3,153
Wisconsin 24.5 23.930 25.070 3,049 2,978 3,120